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Abstract. When testing multiple hypothesis in a survey -e.g. many different source 
00 ' locations, template waveforms, and so on- the final result consists in a set of confidence 

intervals, each one at a desired confidence level. But the probability that at least one of 
these intervals does not cover the true value increases with the number of trials. With 
^ ■ a sufficiently large array of confidence intervals, one can be sure that at least one is 

os : missing the true value. In particular, the probability of false claim of detection becomes 

not negligible. In order to compensate for this, one should increase the confidence level, 
i at the price of a reduced detection power. False discovery rate control[l] is a relatively 

new statistical procedure that bounds the number of mistakes made when performing 
multiple hypothesis tests. We shall review this method, discussing exercise applications 
to the field of gravitational wave surveys. 
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1. Introduction 

The motivation for controlling the false discovery rate (FDR) -i.e. the fraction of 
false alarms in a collection of candidate detections- jumped to our attention as we 
were involved in data analysis for the IGEC [3], the network of resonant detectors 
that searched for coincident burst gravitational wave (GW) signals in the years 1997- 
2000. Even if the detectors involved in IGEC were rather similar, there were obvious 
configurations (special choice of detector pairs, three-fold instead of double coincidence) 
or cuts of the data (higher or lower threshold on event amplitude) characterized by 
lower background counts, or higher duty time. We did not have a priori a good reason 
to prefer one configuration or cut more than others, as we do not know a priori the 
intensity of the signal, hence the efficiency. Therefore, we decided at the beginning a 
fairly long list of interesting choices, in order to perform many analyses in parallel, and 
eventually to quote the results for each trial. 
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Table 1. Quick-reference notation chart for the variables used in section 2. m is 
the total number of performed tests (trial factor), mo and mi the real number of 
underlying off-source and on-source tests. The number of actually positive tests is R, 
given by S true positives and B spurious claims. An ideal experiment would neither 
treat background as signal (type I error) nor do the reverse (type II errors). 





Null Retained 
(cannot reject) 


Reject Null 
(i.e. accept alternative) 
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(background) 
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mi 


(signal) 


Type II Error 


Detected signals 




m-R 


R=B+S 


m 




Reported signal candidates 



The results were expressed as confidence intervals on the expectation value for the 
number of counts in coincidence due to GW. When unveiling the final results, one of 
the confidence intervals at 90% coverage was not including the null hypothesis (i.e. zero 
counts). Of course this can be somewhat expected by chance when the number of trials 
is very high. It was possible to compute accurately that with 30% probability we had a 
chance that at least one of the tests falsely rejected the null hypothesis. 

The probability of at least one false claim in a set of trials is known as family-wise 
error rate (FWER). It is not difficult to devise a method to control this quantity before 
going to the results: we just have to increase the confidence in the single trial (say 99%, 
or 99.99% coverage) in order to keep the FWER much lower than one. The drawback is 
that the resulting confidence interval would be much larger, and consequently the power 
of the search would fall dramatically. This is a consequence of the request that not even 
in a single case the null hypothesis is rejected when it is true. 

A very reasonable compromise was suggested by Benjamini and Hochberg[l]. They 
remark that in many practical cases, when having one or more false claim is not by itself 
unacceptable, we could just be happy if -on average- most of the claims were real. In 
other words, they propose to bound FDR instead of FWER. 

There are many topics in GW search which would benefit from this kind of 
procedure. For instance: 

• all sky surveys: many source directions and polarizations are tried in parallel; 

• template banks; 

• eyes-wide-open searches: many alternative analysis pipelines, with different 
amplitude thresholds, signal duration, and so on are applied on the same data. 

• periodic updates of results: every new science run is a chance for a "discovery" 
("Maybe next one is the good one"); 

• Many graphical representations or aggregations of the data ( "With a slight change 
in the binning, the 'signal' shows up better" ) 
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This work means not to be a complete review of the state-of-art techniques about 
FDR control, but hopefully it will be a stimulus for whoever is involved in multiple-test 
data analysis issues. 

In the following sections, we shall use the notation reported in table 1. 

2. Description of the method 

2.1. Preliminary remarks 

In order to decide whether the results of a measurement are compatible with being 
generated by noise only {null hypothesis, HO) or instead they contain a signal {alternative 
hypothesis, HI) the textbook procedure is to set up a test statistic t from the measures 
themselves. If F (t) is the distribution of t when the HO holds, then the p- value of t 
is defined as p = F {t) = Pr(t > *|Vio)- By construction p is uniformly distributed 
between and 1: 



It is of paramount importance that the distribution Fq is known. It is always wise to 
check a priori models with a goodness-of-fit test, when there are enough off-source data 
available. This is not always the case, but often there are surrogate procedures (e.g. 
data permutation) which give fresh independent samples of the background process, 
removing at the same time the effect of real signals, if any are present in the data. For 
instance, in the case of IGEC, the resampling procedure consisted in adding a delay to 
the time reference of one of the detectors in the network, such that the coincident signal 
is lost, while the background expectation value of coincidence counts is approximately 
unchanged. In case the data are not compliant with the model, at worse resampled data 
may allow to estimate F by empirical fit. 

As for HI, it is usually unknown, but for our purposes it is sufficient assuming that 
the signal can be distinguished from the noise, i.e. Pr{p < p \0 < p < 1) ^ p . The 
sketch in figure 1 {top left) illustrates the concept. 

For a single hypothesis test, the condition "reject null if p < a" leads to false 
positives with probability a. In case of multiple tests, we deal with a set p = 
{Pi,P2, • • -Pm} of p-levels, which need not to derive from the same test statistics, nor 
they should refer to same tested null hypothesis, m is called the trial factor. We select 
discoveries using a threshold T(p): "reject null if pj < T(p)". 

2.2. Controlling Type I errors (B) 

The uncorrected testing would just use the same threshold for each test: T(p) = a. The 
probability that at least one rejection is wrong grows as P{B > 0) = 1 — (1 —a) m ~ ma. 

Therefore, as in the IGEC case, false discovery is guaranteed for m large enough. 

The other extreme solution, usually referred to as the Bonferroni procedure [4], 
controls the FWER in the most stringent manner, by requiring that P{B > 0) < a. 



Pr{p < p \0 < p < 1) =p 
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This is achieved by the choice T(p) = a/m While this approach makes mistakes rare, 
the cost is low efficiency (5 « 0). 

2.3. Controlling false discovery rate (B/R) 

In order to trade-off between B=0 and S=0, the FDR control focuses on the ratio of 
false discoveries to the total number of claims: 

J B/R = B/(B + S) HR>0 
FDR ={ o ifi2 = (2) 

This can be done with a proper choice of T(p). The original procedure suggested 
by Benjamini and Hochberg (BH) is extremely simple, involving only trivial algebraic 
operations. It consists in the following steps. 

• sort the p-values in ascending order: {pi,p2, • • -Pm\i < j =>■ Pi < Pj}] 

• choose your desired FDR q (in case no signal source is actually present during 
the observation, then the procedure is equivalent to the Bonferroni procedure with 
a = q); 

• define c{m) = 1 if Rvalues are independent or positively correlated; otherwise 

c(m) = ET=i Vj; 

• determine the threshold T(p) = pj by finding the index j such that pk > 
k(q/m)/c(m) when k > j (see figure 1 for a visual representation of this condition). 

The above procedure with c{m) = 1 was shown [1] to control the expectation value§ 
of FDR at least at level q in the case when all m tests are independent. However, it 
was later proved to control FDR when tests are positively correlated [2] (for instance, 
multivariate normal data where the covariance matrix has all positive elements). The 
alternative definition of c(m) given above controls FDR in the most general case [2] , but 
at the cost of reduced efficiency. 

There is a nice back-of-the-envelope plausibility argument which can be found in 
[5] for the simple case when signals are easily separable (e.g. signals with high signal- 
to- noise ratios). In this case we expect their p-level to be very low, and correspondingly 
in the cumulative histogram of p-levels we shall see a step with height S near p m 0, see 
figure 1 (bottom right). We see also that there is only one intersection point for the BH 
procedure, such that 

T(p)/R = q/m (3) 

On the other hand, the threshold T(p) can be expressed on average by B/m (this is a 
special case of (1)). Substituting this value in (3) we obtain 

Bj R = qmo/m < q (4) 

For a rigorous proof see [2] . 

§ Of course, the quantity FDR is a random variable, as well as the p-values. 
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Figure 1. (a) The probability density function of p-values when data come from a 
mixed model can be thought as the sum of a uniform distribution (background) and a 
biased one (signal), (b) The Benjamini-Hochberg procedure (BH) consists in plotting 
the cumulative histogram of the p- values of the m trials (continuos line) and looking 
for intersections with a line drawn from the origin and with slope equal to m ■ c(m) / q 
(dashed line). The null hypothesis is rejected for all data with p- value between and 
the abscissa of the highest intersection point. 

(c) Sketch of histogram of p-values and (d) corresponding cumulative histogram, in a 
case of easily separable signals. The BH procedure applied to this case can be easily 
shown to control FDR (see section 2.3 for details). 



3. Numerical test of the method 

We now demonstrate this procedure with a simple example. Suppose we are given the 
results of 50 counting experiments, labeled by the index i. Their background is modeled 
as a Poisson random variable, with the same|| known expectation value for all i. 

We consider two possible cases: in the first one, we draw 50 independent measures, 
in the other case we generate correlation by summing neighbor bins (i.e., if n l c represent 
independent counts in the z-th bin, then the 50 correlated counts n n c are defined as 
n 'c = n c + n l~ l > where n° c = n^°). We investigated different background levels (from 
Nf, = 0.01 to Nb = 50) and different number of detected signals (iV s = — 6), assuming 
-for sake of simplicity- that each bin can have either one or zero counts due to true 
signals. 

In order to decide the presence of a signal we use the one-tail Poisson probability 
for the expected number of counts in each bin. In table 2 the results of a Monte Carlo 
simulation are shown. For each configuration (differing by average background and 
extent of true signals) we compute the average number of claims R, i.e. the number of 
bins for which the null hypothesis is rejected. We present the results for Bonferroni and 
BH tests, both tuned to bound the FWER at 1% when no signal is present. 

Both procedures are working as expected, controlling the FWER and the FDR 

|| To avoid degeneracy due to the discreteness of the test statistic (many results collapsing at the same 
p- values), we actually spread the background of the experiments in a range ±1% around Nf,). 
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Table 2. Results of the simulation described in section 3. The first column lists values 
of N s , the other columns refer to differ values of N},, as listed in the first row. Each 
entry corresponding to a {N s , Nb} couple is composed by values, the upper one refers 
to the Bonferroni procedure, the lower to the BH procedure. These values are averaged 
over 40000 samples and the statistical precision is of the order of 0.005. 
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3 


3.005 
3.010 


0.001 
3.019 


0.003 
0.004 


0.006 
0.006 


0.013 
0.013 


0.032 
0.032 


0.069 
0.069 


0.021 
0.021 


0.006 
0.006 


0.005 
0.005 


4 


4.005 
4.010 


0.001 
4.018 


0.004 
0.125 


0.008 
0.008 


0.017 
0.017 


0.043 
0.043 


0.086 
0.086 


0.027 
0.027 


0.007 
0.007 


0.006 
0.006 


5 


5.004 
5.009 


0.002 
5.018 


0.005 
5.046 


0.009 
0.009 


0.020 
0.020 


0.053 
0.053 


0.106 
0.106 


0.029 
0.029 


0.008 
0.008 


0.006 
0.006 


6 


6.004 
6.009 


0.002 
6.017 


0.006 
6.043 


0.013 
0.013 


0.024 
0.024 


0.061 
0.061 


0.124 
0.124 


0.034 
0.034 


0.010 
0.010 


0.007 
0.008 






Table 3. 


Same as 


table 2 but for the 


case of correlated measures. 
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3 
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0.001 
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0.003 
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0.012 
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0.032 
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0.022 
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0.006 


0.005 
0.006 


4 


4.005 
4.009 


0.002 
4.017 


0.004 
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0.017 


0.043 
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0.085 


0.025 
0.025 


0.007 
0.007 


0.005 
0.006 


5 


5.004 
5.009 


0.002 
5.018 


0.005 
5.047 


0.010 
0.010 


0.019 
0.019 


0.051 
0.051 


0.107 
0.108 


0.029 
0.029 


0.008 
0.008 


0.007 
0.007 


6 


6.004 
6.009 


0.003 
6.016 


0.007 
6.044 


0.011 
0.013 


0.024 
0.024 


0.061 
0.061 


0.127 
0.127 


0.035 
0.035 


0.010 
0.010 


0.007 
0.007 



respectively at the desired level. For high background values they give as expected 
similar results. On the other side the efficiency of the Bonferroni procedure falls to zero 
for N b > 0.01, while the BH procedure is still effective, up to N b = 0.05 in this example. 

In figure 2 we can visualize how the BH procedure manage to grasp the signals 
promptly, as the background level lowers (see also figure 1). 
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Figure 2. A few samples from the Monte Carlo used to produce table 2 are displayed 
in detail. They refer to N s = 5, and the background is (a) Nt, = 50 (b) Nb — 0.5 (c) 
Nb = 0.01. In the plots above the cumulative histogram of the p- values is compared 
with the threshold given by the Bonferroni (- -) and the BH ( ) procedures. 



4. Conclusions 

When multiple tests are tried for the same data set, controlling FDR seems in general 
a wiser idea than just limiting type-I errors. Robust but simple procedures exist which 
(conservatively) control FDR in positively correlated tests, and also in the more general 
case (but at the cost of reduced efficiency) . 

This idea is relatively new in the astrophysics community, and we are not aware of 
any application in the GW community. Its application should be encouraged. Notice 
however that BH procedure is not the only one, and more complex -but approximate- 
strategies have been investigated (see for instance [7, 6]). 
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